R functions

Functions are an extremely important way to write concise code whenever we need to repeat the same “set of operations” multiple times.

Example:

fahrenheit_to_celsius <- function(temp_F){
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}

temp <- fahrenheit_to_celsius(32)
temp

[1] 0

We can compose the result of different functions

celsius_to_kelvin <- function(temp_C){
  temp_K <- temp_C + 273.15
  return(temp_K)
}

The conversion from Fahreneit to Kelvin can be obtained via

celsius_to_kelvin(fahrenheit_to_celsius(100))

[1] 310.9278

You can set default inputs as

mySum <- function(num1, num2 = 10){
  out <- num1 + num2
  return(out)
}
mySum(2)

[1] 12

mySum(2, 5)

[1] 7

You can return multiple outputs as

my_operations <- function(num1, num2) {
  out <- num1 + num2
  out1 <- num1 * num2
  return(list('sum' = out, 'prod' = out1))
}
res <- my_operations(2, 5)
res$sum

[1] 7

res$prod

[1] 10

Remember the scoping rules: variables defined within a function are not visible outside of it. Conversely, the body of a function can see the variables that are outside of it.

Model syntax

Let us generate some data from a linear model.

set.seed(1234)
x <- rnorm(100)
y <- 1 + 2 * x + rnorm(100) # rnorm(100) is the additive noise
plot(x, y, xlab = 'x', ylab = 'y', pch = 16)

Simple linear fit: we use the lm() function

fit <- lm(y ~ x) # y = \beta_0 + \beta_1 * x
summary(fit)


Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.88626 -0.61401  0.00236  0.58645  2.98774 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.0372     0.1050    9.88   <2e-16 ***
x             1.9739     0.1038   19.02   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.037 on 98 degrees of freedom
Multiple R-squared:  0.7869,    Adjusted R-squared:  0.7847 
F-statistic: 361.8 on 1 and 98 DF,  p-value: < 2.2e-16

Remember how to interpret this table:

the t-test is telling us if each single predictor is significant, i.e. it is testing $H_0: \beta_j = 0$ versus $H_a: \beta_j \neq 0$
the F-test is telling us if there is a relationship between the response and the predictors. It is testing jointly the significance of all the predictors, i.e. $H_0: \beta_0 = \dots = \beta_p = 0$ versus $H_a: \exists \beta_j \neq 0$
the residual standard error is the estimate of the residual standard deviation and can be calculated via $s = \sqrt{ESS / (n-2)}$ where $ESS = e_1^2 + \dots + e_n^2$ is the error sum of squares

There are (at least) two ways to plot the regression line:

we create a fine grid of $x$ values and we use the function predict()
we use abline() and input the regression coefficients

Method 1:

xgrid <- seq(min(x), max(x), length.out = 100)
beta_hat <- coef(fit)

plot(x, y, xlab = 'x', ylab = 'y', pch = 16)
lines(xgrid, as.numeric(predict(fit, newdata = list(x = xgrid))), col = cols[1], lwd = 3)
points(mean(x), mean(y), col = cols[2], cex = 1.5, pch = 16)

Method 2:

plot(x, y, xlab = 'x', ylab = 'y', pch = 16)
abline(coef(fit), col = cols[1], lwd = 3)
points(mean(x), mean(y), col = cols[2], cex = 1.5, pch = 16)

The advantage of method 1 is that it generalizes to any statistical model.

Exercise

Let’s take a look at the values of $\hat{\beta}_0, \hat{\beta}_1$. We could have just computed them by hand, right? Remember that \[\hat{\beta}_1 = r_{xy} \frac{s_y}{s_x}, \quad \hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}.\]

my_beta_hat <- c(mean(y) - cor(x, y) * sd(y)/sd(x) * mean(x), cor(x, y) * sd(y)/sd(x))
print(cbind(beta_hat, my_beta_hat))

            beta_hat my_beta_hat
(Intercept) 1.037154    1.037154
x           1.973915    1.973915

What happens if we replicate the data generating mechanism?

B <- 1000
betas <- array(NA, dim = c(B, 2))
plot(x, y, xlab = 'x', ylab = 'y', pch = 16)
for (b in 1:B){
  x <- rnorm(100)
  y <- 1 + 2 * x + rnorm(100)
  fit <- lm(y ~ x)
  betas[b,] <- coef(fit)
  abline(fit, col = alpha(cols[1], 0.05))
}

More complicated models:

y ~ x         # y = beta_0 + beta_1 * x
y ~ 0 + x     # y = beta_1 * X
y ~ x1 + x2   # y = beta_0 + beta_1 * x1 + beta_2 * x2
y ~ x1 * x2   # y = beta_0 + beta_1 * x1 + beta_2 * x2 + beta_3 * x1 * x2
y ~ I(x / 2)  # y = beta_0 + beta_1 * (x / 2)

Note that I evaluates argument as a regular R expression. Also, you should know that the lm function is not doing anything fancy. It is simply solving efficiently the normal equations and reporting a bunch of test statistics. In principle, you could code your own LinearModel function. You just have to remember that

\[\widehat{\boldsymbol{\beta}} = (X^{\intercal} X)^{-1} X^{\intercal} \mathbf{y},\]

which is the multivariate equivalent of the formulas that we have seen in class.

Simple linear regression

The goal is to use the dataset Boston to predict median house value using 13 predictors such as rm (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status). We first start with a univariate regression in which the only covariate is lstat. As usual, let us start by loading the libraries that we will use.

rm(list=ls())
setwd("~/Desktop/SDS323 - Fa20/Laboratories/Lab 2 - Regression/")
library(ggplot2)
library(latex2exp)
library(RColorBrewer)
library(ISLR)
library(MASS)
library(car)
library(plotrix)
theme_set(theme_bw(base_size = 14))
cols <- brewer.pal(9, "Set1")

names(Boston)

 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

?Boston # get information on the data set

fit1 <- lm(medv ~ lstat, data = Boston) 
summary(fit1) # summary of the linear model fit is more useful


Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

coef(fit1)

(Intercept)       lstat 
 34.5538409  -0.9500494

confint(fit1) # CI for the intercept and slope

                2.5 %     97.5 %
(Intercept) 33.448457 35.6592247
lstat       -1.026148 -0.8739505

par(mar=c(4,4,2,2))
plotCI(x = 1:2, y = coef(fit1), ui = confint(fit1)[,2], li = confint(fit1)[,1], 
       lwd = 2, xaxt = 'n', xlab = '', ylab = 'value', pch = 16)
axis(1, at = 1:2, labels = c(TeX("$\\beta_0$"), TeX("$\\beta_1$")))
abline(h = 0, lwd = 2, lty = 2, col = cols[1])

We can make prediction on new x values using CI or PI. Remember the difference?

predict(fit1, data.frame(lstat = c(5,10,15)), interval = "confidence")

       fit      lwr      upr
1 29.80359 29.00741 30.59978
2 25.05335 24.47413 25.63256
3 20.30310 19.73159 20.87461

predict(fit1, data.frame(lstat = c(5,10,15)), interval = "prediction")

       fit       lwr      upr
1 29.80359 17.565675 42.04151
2 25.05335 12.827626 37.27907
3 20.30310  8.077742 32.52846

Let us show the CI for the regression line, i.e. for the quantity $\mathbb{E}[Y \mid X]$

xgrid <- seq(min(Boston$lstat), max(Boston$lstat), length.out = 100)
plot(Boston$lstat, Boston$medv, pch = 16, xlab = 'lstat', ylab = 'medv') # scatter plot of medv and lstat
abline(fit1, col = cols[1], lwd = 3) # add the regression line in the scatter plot
y_hat <- predict(fit1, data.frame(lstat = xgrid), interval = "confidence") 
polygon(c(xgrid,rev(xgrid)), c(y_hat[,'lwr'],rev(y_hat[,'upr'])), 
        col = alpha('gray', 0.5), border = alpha('gray', 0.5))

It is widely different from the PI for a new observation, i.e. for the quantity $Y^\star$.

plot(Boston$lstat, Boston$medv, pch = 16, xlab = 'lstat', ylab = 'medv')
abline(fit1, col = cols[1], lwd = 3)
y_hat <- predict(fit1, data.frame(lstat = xgrid), interval = "prediction") 
polygon(c(xgrid,rev(xgrid)), c(y_hat[,'lwr'],rev(y_hat[,'upr'])), 
        col = alpha('gray', 0.5), border = alpha('gray', 0.5))

We have already noted that some data points seem to be problematic.

par(mfrow = c(2,2))
plot(fit1) # make residual plot

par(mfrow = c(1,1))
plot(hatvalues(fit1), pch = 16) # hat value is the measure for leverage
abline(h = length(coef(fit1))/nrow(Boston), lwd = 2, lty = 2, col = cols[1])

Let us see how the fitted values compare to the observed ones

y_hat <- as.numeric(predict(fit1, data.frame(lstat = Boston$lstat)))
plot(Boston$medv, y_hat, pch = 16, xlab = 'observed', ylab = 'fitted')
abline(a = 0, b = 1, lty = 2, lwd = 2, col = cols[1])

The MSE (mean squared error) is

mean((y_hat - Boston$medv)^2)

[1] 38.48297

How would you compute the RSS?

Multiple linear regression

Let us now use multiple predictors! Let’s first take a look at the data.

pairs(Boston)

This does not help much, but hopefully the regression model can shed a light on some of these relationships. Before starting, we can take a look at the correlations in the data.

library(ggcorrplot)
ggcorrplot(round(cor(Boston), 2), hc.order = TRUE, type = "lower",
   lab = TRUE)

In the multivariate case, there are two equivalent ways of writing a linear model with lm().

As before, we can input a dataframe (data = Boston) and use the regular syntax $Y \sim X_1 + \dots + X_p$
Or, we can input the design matrix $X$ and the outcome $Y$. A very useful function to create the design matrix is model.matrix()

fit2 <- lm(medv ~ lstat + age, data = Boston) # same lm command for simple or multiple linear regression
summary(fit2)


Call:
lm(formula = medv ~ lstat + age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.981  -3.978  -1.283   1.968  23.158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
age          0.03454    0.01223   2.826  0.00491 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared:  0.5513,    Adjusted R-squared:  0.5495 
F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

X <- model.matrix(medv ~ lstat + age, data = Boston)
head(X)

  (Intercept) lstat  age
1           1  4.98 65.2
2           1  9.14 78.9
3           1  4.03 61.1
4           1  2.94 45.8
5           1  5.33 54.2
6           1  5.21 58.7

Y <- Boston$medv

fit2 <- lm.fit(X, Y)
coef(fit2)

(Intercept)       lstat         age 
33.22276053 -1.03206856  0.03454434

fit3 <- lm(medv ~ ., Boston) # . means all variables in the data set except the response
summary(fit3)


Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(fit3)

fit4 <- lm(medv ~ .- age, Boston) # the "-" means "do not include a particular variable"
summary(fit4)


Call:
lm(formula = medv ~ . - age, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6054  -2.7313  -0.5188   1.7601  26.2243 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.436927   5.080119   7.172 2.72e-12 ***
crim         -0.108006   0.032832  -3.290 0.001075 ** 
zn            0.046334   0.013613   3.404 0.000719 ***
indus         0.020562   0.061433   0.335 0.737989    
chas          2.689026   0.859598   3.128 0.001863 ** 
nox         -17.713540   3.679308  -4.814 1.97e-06 ***
rm            3.814394   0.408480   9.338  < 2e-16 ***
dis          -1.478612   0.190611  -7.757 5.03e-14 ***
rad           0.305786   0.066089   4.627 4.75e-06 ***
tax          -0.012329   0.003755  -3.283 0.001099 ** 
ptratio      -0.952211   0.130294  -7.308 1.10e-12 ***
black         0.009321   0.002678   3.481 0.000544 ***
lstat        -0.523852   0.047625 -10.999  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7343 
F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16

Let us see how the fitted values compare to the observed ones

y_hat <- as.numeric(predict(fit3, newdata = Boston))
plot(Boston$medv, y_hat, pch = 16, xlab = 'observed', ylab = 'fitted')
abline(a = 0, b = 1, lty = 2, lwd = 2, col = cols[1])

The MSE (mean squared error) is

mean((y_hat - Boston$medv)^2)

[1] 21.89483

Something we have not considered here but can be tried:

transformation of the outcome $Y$
transformation of the predictors $\mathbf{X}$
standardization of the predictors: this can help if they are on very different scales

Nonlinear terms and Interactions

fit5 <- lm(medv ~ lstat * age, Boston) # * is for interaction term
summary(fit5)


Call:
lm(formula = medv ~ lstat * age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.806  -4.045  -1.333   2.085  27.552 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
age         -0.0007209  0.0198792  -0.036   0.9711    
lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared:  0.5557,    Adjusted R-squared:  0.5531 
F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

fit6 <- lm(medv ~ lstat + I(lstat^2), Boston) # need to have I() to protect the square term
summary(fit6)


Call:
lm(formula = medv ~ lstat + I(lstat^2), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2834  -3.8313  -0.5295   2.3095  25.4148 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.862007   0.872084   49.15   <2e-16 ***
lstat       -2.332821   0.123803  -18.84   <2e-16 ***
I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared:  0.6407,    Adjusted R-squared:  0.6393 
F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

fit7 <- lm(medv ~ poly(lstat, 4, raw = T), Boston)
summary(fit7)


Call:
lm(formula = medv ~ poly(lstat, 4, raw = T), data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.563  -3.180  -0.632   2.283  27.181 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               5.731e+01  2.280e+00  25.134  < 2e-16 ***
poly(lstat, 4, raw = T)1 -7.028e+00  7.308e-01  -9.618  < 2e-16 ***
poly(lstat, 4, raw = T)2  4.955e-01  7.489e-02   6.616 9.50e-11 ***
poly(lstat, 4, raw = T)3 -1.631e-02  2.994e-03  -5.448 7.98e-08 ***
poly(lstat, 4, raw = T)4  1.949e-04  4.043e-05   4.820 1.90e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.28 on 501 degrees of freedom
Multiple R-squared:  0.673, Adjusted R-squared:  0.6704 
F-statistic: 257.8 on 4 and 501 DF,  p-value: < 2.2e-16

xgrid <- seq(min(Boston$lstat), max(Boston$lstat), length.out = 100)
y_hat6 <- as.matrix(model.matrix(~ xgrid + I(xgrid^2))) %*% coef(fit6)
y_hat7 <- as.matrix(model.matrix(~ poly(xgrid, 4, raw = T))) %*% coef(fit7)
par(mfrow = c(1,1))
plot(Boston$lstat, Boston$medv, pch = 16, xlab = 'lstat', ylab = 'medv')
lines(xgrid, y_hat6, col = cols[1], lwd = 3)
lines(xgrid, y_hat7, col = cols[2], lwd = 3)
legend('topright', col = cols[1:2], legend = c('Quadratic','Fourth degree'), lwd = 3)

Qualitative predictors

In greenmoving.txt there is a list of travel times (in minutes) as a function of the length of the route [in m] and of the means of transportation used (walking, cycling, driving).

First, estimate the parameters for a model that assumes a linear dependence between average travel time and length of the route for different means of transportation.

It is a categorical variable with three levels: we need two dummy variables.

green <- read.table('./Data/greenmoving.txt', header=T)

# Let us transform the transport variable in a factor!
green$transport <- factor(green$transport)

walk_idx <- as.numeric(green$transport == 'walk')
bike_idx <- as.numeric(green$transport == 'bike')
car_idx <- as.numeric(green$transport == 'car')


ggplot() + 
  geom_point(aes(x = length, y = time, col = transport), data = green) + 
  labs(x = 'distance', y = 'time') + 
  scale_color_brewer(name = 'Transport', palette = 'Set1')

The reference is going to be bike (first in alphabetical order in the factor)!

\[time = \beta_0 + \beta_1*car + \beta_2*walk + \beta_3*length + \beta_4*length*car + \beta_5*length*walk + \varepsilon\]

This is a model with interactions! Therefore we have a slope and an intercept for every level of transport.

Bike regression line: $\beta_0 + \beta_3*length + \varepsilon$

Car regression line: $(\beta_0 + \beta_1) + (\beta_3 + \beta_4)*length + \varepsilon$

Walk regression line: $(\beta_0 + \beta_2) + (\beta_3 + \beta_5)*length + \varepsilon$

fit <- lm(time ~ car_idx + walk_idx + length + length:car_idx + length:walk_idx, data = green)
summary(fit)


Call:
lm(formula = time ~ car_idx + walk_idx + length + length:car_idx + 
    length:walk_idx, data = green)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0179 -2.0579  0.0629  2.1336  6.3213 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.5667208  1.3487377   3.386 0.000915 ***
car_idx          1.2107320  1.6665020   0.727 0.468705    
walk_idx        -1.1597079  1.9965712  -0.581 0.562250    
length           0.0052521  0.0013307   3.947 0.000123 ***
car_idx:length  -0.0002884  0.0013484  -0.214 0.830913    
walk_idx:length  0.0065389  0.0018779   3.482 0.000659 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.135 on 144 degrees of freedom
Multiple R-squared:  0.899, Adjusted R-squared:  0.8955 
F-statistic: 256.4 on 5 and 144 DF,  p-value: < 2.2e-16

ggplot() + 
  geom_point(aes(x = length, y = time, col = transport), data = green) + 
  geom_abline(intercept = sum(coef(fit)[1]), slope = sum(coef(fit)[4]), col = cols[1]) + 
    geom_abline(intercept = sum(coef(fit)[1:2]), slope = sum(coef(fit)[4:5]), col = cols[2]) + 
  geom_abline(intercept = sum(coef(fit)[c(1,3)]), slope = sum(coef(fit)[c(4,6)]), col = cols[3]) + 
  labs(x = 'distance', y = 'time') + 
  scale_color_brewer(name = 'Transport', palette = 'Set1')

There is a much better way of doing this. Instead of creating the dummy variables by hand, we let R take care of it. We only have to declare that the variable transport is a factor!

fit <- lm(time ~ transport + length + length:transport, data = green)
summary(fit)


Call:
lm(formula = time ~ transport + length + length:transport, data = green)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0179 -2.0579  0.0629  2.1336  6.3213 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           4.5667208  1.3487377   3.386 0.000915 ***
transportcar          1.2107320  1.6665020   0.727 0.468705    
transportwalk        -1.1597079  1.9965712  -0.581 0.562250    
length                0.0052521  0.0013307   3.947 0.000123 ***
transportcar:length  -0.0002884  0.0013484  -0.214 0.830913    
transportwalk:length  0.0065389  0.0018779   3.482 0.000659 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.135 on 144 degrees of freedom
Multiple R-squared:  0.899, Adjusted R-squared:  0.8955 
F-statistic: 256.4 on 5 and 144 DF,  p-value: < 2.2e-16

Some dummy variables do not seem to be significant. Let us verify it in a more precise way.

Run two tests in order to assess if the means of transportation is significant, and then if the length of the route is significant.

The means of transportation is significant if AT LEAST one of the coefficients related to the dummy variables is significantly different from 0, i.e. the null hypothesis is \[\beta_{1} = \beta_{2} = \beta_{4} = \beta_{5} = 0.\]

rbind(c(0,1,0,0,0,0),
                       c(0,0,1,0,0,0),
                       c(0,0,0,0,1,0),
                       c(0,0,0,0,0,1))

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    1    0    0    0    0
[2,]    0    0    1    0    0    0
[3,]    0    0    0    0    1    0
[4,]    0    0    0    0    0    1

linearHypothesis(fit,
                 rbind(c(0,1,0,0,0,0),
                       c(0,0,1,0,0,0),
                       c(0,0,0,0,1,0),
                       c(0,0,0,0,0,1)),
                 c(0,0,0,0))

Linear hypothesis test

Hypothesis:
transportcar = 0
transportwalk = 0
transportcar:length = 0
transportwalk:length = 0

Model 1: restricted model
Model 2: time ~ transport + length + length:transport

  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    148 2606.3                                  
2    144 1415.0  4    1191.3 30.309 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus, it appears that the transportation does influence the response.

The length of the route is significant if AT LEAST one of the coefficients where it appears is significantly different from 0, i.e. the null hypothesis is \[\beta_{3} = \beta_{4} = \beta_{5} = 0.\]

linearHypothesis(fit,
                 rbind(c(0,0,0,1,0,0),
                       c(0,0,0,0,1,0),
                       c(0,0,0,0,0,1)),
                 c(0,0,0))

Linear hypothesis test

Hypothesis:
length = 0
transportcar:length = 0
transportwalk:length = 0

Model 1: restricted model
Model 2: time ~ transport + length + length:transport

  Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
1    147 7452.2                                 
2    144 1415.0  3    6037.2 204.8 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus, it appears that the length of the route does influence the response.

Now, propose a reduced model on the basis of the previous tests.

Let’s look at the p-values for the individual tests!

summary(fit)


Call:
lm(formula = time ~ transport + length + length:transport, data = green)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0179 -2.0579  0.0629  2.1336  6.3213 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           4.5667208  1.3487377   3.386 0.000915 ***
transportcar          1.2107320  1.6665020   0.727 0.468705    
transportwalk        -1.1597079  1.9965712  -0.581 0.562250    
length                0.0052521  0.0013307   3.947 0.000123 ***
transportcar:length  -0.0002884  0.0013484  -0.214 0.830913    
transportwalk:length  0.0065389  0.0018779   3.482 0.000659 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.135 on 144 degrees of freedom
Multiple R-squared:  0.899, Adjusted R-squared:  0.8955 
F-statistic: 256.4 on 5 and 144 DF,  p-value: < 2.2e-16

The summary is suggesting only one intercept and only two slopes (walk vs other transportation). Let’s test if we can block remove those three coefficients using as null hypothesis \[\beta_{1} = \beta_{2} = \beta_{5} = 0.\]

linearHypothesis(fit,
                 rbind(c(0,1,0,0,0,0),
                       c(0,0,1,0,0,0),
                       c(0,0,0,0,0,1)),
                 c(0,0,0))

Linear hypothesis test

Hypothesis:
transportcar = 0
transportwalk = 0
transportwalk:length = 0

Model 1: restricted model
Model 2: time ~ transport + length + length:transport

  Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
1    147 2296.7                                 
2    144 1415.0  3     881.7 29.91 4.315e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reduce the model to \[time = \beta_0 + \beta_1*length + \beta_2*length*walk + \varepsilon\]

fit2 <- lm(time ~ length + length:walk_idx, data = green)
summary(fit2)


Call:
lm(formula = time ~ length + length:walk_idx, data = green)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.1640 -1.9820  0.0539  2.0756  6.4508 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.7579758  0.4581664   10.38   <2e-16 ***
length          0.0051614  0.0001438   35.90   <2e-16 ***
length:walk_idx 0.0054701  0.0004992   10.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.124 on 147 degrees of freedom
Multiple R-squared:  0.8976,    Adjusted R-squared:  0.8962 
F-statistic: 644.5 on 2 and 147 DF,  p-value: < 2.2e-16

The three regression lines have the same intercept: 4.758.

The bike regression line and the car regression line have the same slope: 0.005.

The walk regression line has the slope: 0.0054701.

Car and bike have the same regression line.

ggplot() + 
  geom_point(aes(x = length, y = time, col = transport), data = green) + 
  geom_abline(intercept = sum(coef(fit2)[1]), slope = sum(coef(fit2)[2]), col = 'black') + 
  geom_abline(intercept = sum(coef(fit2)[1]), slope = sum(coef(fit2)[2:3]), col = cols[3]) + 
  labs(x = 'distance', y = 'time') + 
  scale_color_brewer(name = 'Transport', palette = 'Set1')

Optional: more elaborate models

The file smog.txt reports the concentrations of PM10 recorded by some weather stations in a town in Alaska between February 1 and February 10, 2010. The mayor of the town has forbidden the use of domestic heaters starting from February 7. City council requests some statistical analysis to study the efficacy of the ban on heaters.

Assuming that

the 50 observations are independent (no time dependence)
the residuals are independent normal realizations
the average PM10 depends only on the day and not on the weather station (no spatial dependence)

estimate the parameters of the following models ($t = 1$ denotes February 1):

$C^{1}$: mean concentration depends linearly in $t$
$C^{0}$: mean concentration is piecewise linear in $t$ (continuous with discontinuous derivative at $t = 6.5$)
$C^{-1}$: mean concentration is piecewise linear in $t$ with a discontinuity at $t = 6.5$

smog <- read.table('./Data/smog.txt', header=T)
head(smog)

  day station   PM10
1   1       1 41.605
2   1       2 41.622
3   1       3 43.427
4   1       4 42.276
5   1       5 42.583
6   2       1 44.219

Let us try and write the possible models:

$C^{1}$: this is the simple linear model \[Y = \beta_0 + \beta_1 t\]
$C^{0}$: this is the linear model \[\begin{aligned} Y &= \beta_0 + \beta_1 t + \beta_2 (t - 6.5) * 1\{t > 6.5\} \\ & \ \\ &= \begin{cases} Y = \beta_0 + \beta_1 t & \text{if $t \leq 6.5$} \\ Y = (\beta_0 - 6.5 \beta_2) + (\beta_1 + \beta_2) t & \text{if $t > 6.5$} \end{cases} \end{aligned}\]
$C^{-1}$: this is the linear model \[\begin{aligned} Y &= \beta_0 + \beta_1 t + \beta_2 * 1\{t > 6.5\} + \beta_3 (t - 6.5) * 1\{t > 6.5\} \\ & \ \\ &= \begin{cases} Y = \beta_0 + \beta_1 t & \text{if $t \leq 6.5$} \\ Y = (\beta_0 + \beta_2 - 6.5 \beta_3) + (\beta_1 + \beta_3) t & \text{if $t > 6.5$} \end{cases} \end{aligned}\]

intervention <- ifelse(smog$day > 6.5, 1, 0)

par(mar=c(4,4,2,2), family = 'serif')
plot(smog$day, smog$PM10, pch = 16, col = alpha('black', 0.5), 
     xlab = 'day', ylab = 'PM10')
abline(v = 6.5, lty = 3)

fit1 <- lm(PM10 ~ day, smog)
fit2 <- lm(PM10 ~ day + I((day - 6.5) * intervention), smog)
fit3 <- lm(PM10 ~ day + intervention + I((day - 6.5) * intervention), smog)

summary(fit1)


Call:
lm(formula = PM10 ~ day, data = smog)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4600 -0.8402 -0.1131  0.6881  3.5109 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.14209    0.38515  104.22   <2e-16 ***
day          2.07188    0.06207   33.38   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.261 on 48 degrees of freedom
Multiple R-squared:  0.9587,    Adjusted R-squared:  0.9578 
F-statistic:  1114 on 1 and 48 DF,  p-value: < 2.2e-16

summary(fit2)


Call:
lm(formula = PM10 ~ day + I((day - 6.5) * intervention), data = smog)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2468 -1.0178  0.0062  0.7391  3.4297 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    39.6276     0.4811  82.364   <2e-16 ***
day                             2.2314     0.1107  20.150   <2e-16 ***
I((day - 6.5) * intervention)  -0.4539     0.2632  -1.724   0.0912 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.236 on 47 degrees of freedom
Multiple R-squared:  0.9612,    Adjusted R-squared:  0.9595 
F-statistic: 581.5 on 2 and 47 DF,  p-value: < 2.2e-16

summary(fit3)


Call:
lm(formula = PM10 ~ day + intervention + I((day - 6.5) * intervention), 
    data = smog)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.33462 -0.67789 -0.03462  0.67975  2.63922 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    40.3889     0.4120  98.038  < 2e-16 ***
day                             1.9303     0.1058  18.247  < 2e-16 ***
intervention                    3.0407     0.5822   5.223 4.15e-06 ***
I((day - 6.5) * intervention)  -0.8555     0.2244  -3.812 0.000408 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9895 on 46 degrees of freedom
Multiple R-squared:  0.9756,    Adjusted R-squared:  0.974 
F-statistic: 613.4 on 3 and 46 DF,  p-value: < 2.2e-16

day_grid <- seq(min(smog$day), max(smog$day), length.out = 200)
intervention_grid <- ifelse(day_grid > 6.5, 1, 0)
  
lines(day_grid, predict(fit1, newdata = list(day = day_grid)), col = cols[1], pch = 16, lwd = 2)
lines(day_grid, predict(fit2, newdata = list(day = day_grid, intervention = intervention_grid)), col = cols[2], pch = 16, lwd = 2)
lines(day_grid, predict(fit3, newdata = list(day = day_grid, intervention = intervention_grid)), col = cols[3], pch = 16, lwd = 2)
legend('topleft', legend = c(TeX('$C^1$'),TeX('$C^0$'),TeX('$C^{-1}$')), col = cols[1:3], lwd = 2)

Using model $C^{0}$, is there statistical evidence to say that the ban has reduced the speed of mean growth of pollution?

summary(fit2)


Call:
lm(formula = PM10 ~ day + I((day - 6.5) * intervention), data = smog)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2468 -1.0178  0.0062  0.7391  3.4297 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    39.6276     0.4811  82.364   <2e-16 ***
day                             2.2314     0.1107  20.150   <2e-16 ***
I((day - 6.5) * intervention)  -0.4539     0.2632  -1.724   0.0912 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.236 on 47 degrees of freedom
Multiple R-squared:  0.9612,    Adjusted R-squared:  0.9595 
F-statistic: 581.5 on 2 and 47 DF,  p-value: < 2.2e-16

In particular, the mayor says that his action halved the speed of mean growth of pollution starting from February 7. Is there enough evidence to prove his statement wrong?

Remember the model: this corresponds to the hypothesis $\beta_1 + \beta_2 \text{ (slope after intervention)} = \frac{1}{2} \beta_1 \text{ (slope before intervention)}$.

This corresponds to $H_0: \beta_1 + 2 \beta_2 = 0$.

linearHypothesis(fit2, 
                 rbind(c(0,1,2)), 
                 0)

Linear hypothesis test

Hypothesis:
day  + 2 I((day - 6.5) * intervention) = 0

Model 1: restricted model
Model 2: PM10 ~ day + I((day - 6.5) * intervention)

  Res.Df    RSS Df Sum of Sq     F   Pr(>F)   
1     48 85.678                               
2     47 71.749  1    13.929 9.124 0.004071 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also calculate the confidence interval for a linear combination of the parameters:

library(multcomp)
hyp_matrix <- array(0, dim = c(2, length(coef(fit2)))) # 2 is the number of CI to compute, length(coef(fit2)) is the model dimension
hyp_matrix[1,] <- c(0,1,0)
hyp_matrix[2,] <- c(0,1,1)
confcm <- glht(fit2, linfct = hyp_matrix)
confcm <- confint(confcm)

plotCI(x = 1:2, y = confcm$confint[,1], ui = confcm$confint[,3], 
       li = confcm$confint[,2], pch = 21, lwd = 2, xaxt = 'n', xlab = '', 
       ylab = 'value', col = cols[1])
axis(1, at = 1:2, labels = c(TeX('$\\beta_1$'), TeX('$\\beta_1 + \\beta_2$')))

Laboratory 2: Linear Regression

Giorgio Paulon

September 17, 2020